132 research outputs found
DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms
Datacenters are increasingly becoming heterogeneous, and are starting to
include specialized hardware for networking, video processing, and especially
deep learning. To leverage the heterogeneous compute capability of modern
datacenters, we develop an approach for compiler-level partitioning of deep
neural networks (DNNs) onto multiple interconnected hardware devices. We
present a general framework for heterogeneous DNN compilation, offering
automatic partitioning and device mapping. Our scheduler integrates both an
exact solver, through a mixed integer linear programming (MILP) formulation,
and a modularity-based heuristic for scalability. Furthermore, we propose a
theoretical lower bound formula for the optimal solution, which enables the
assessment of the heuristic solutions' quality. We evaluate our scheduler in
optimizing both conventional DNNs and randomly-wired neural networks, subject
to latency and throughput constraints, on a heterogeneous system comprised of a
CPU and two distinct GPUs. Compared to na\"ively running DNNs on the fastest
GPU, he proposed framework can achieve more than 3 times lower latency
and up to 2.9 higher throughput by automatically leveraging both data
and model parallelism to deploy DNNs on our sample heterogeneous server node.
Moreover, our modularity-based "splitting" heuristic improves the solution
runtime up to 395 without noticeably sacrificing solution quality
compared to an exact MILP solution, and outperforms all other heuristics by
30-60% solution quality. Finally, our case study shows how we can extend our
framework to schedule large language models across multiple heterogeneous
servers by exploiting symmetry in the hardware setup. Our code can be easily
plugged in to existing frameworks, and is available at
https://github.com/abdelfattah-lab/diviml.Comment: accepted at ICCAD'2
BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs
Deep neural network (DNN) inference using reduced integer precision has been
shown to achieve significant improvements in memory utilization and compute
throughput with little or no accuracy loss compared to full-precision
floating-point. Modern FPGA-based DNN inference relies heavily on the on-chip
block RAM (BRAM) for model storage and the digital signal processing (DSP) unit
for implementing the multiply-accumulate (MAC) operation, a fundamental DNN
primitive. In this paper, we enhance the existing BRAM to also compute MAC by
proposing BRAMAC (Compute-in-AM
rchitectures for
ultiply-cumulate). BRAMAC supports
2's complement 2- to 8-bit MAC in a small dummy BRAM array using a hybrid
bit-serial & bit-parallel data flow. Unlike previous compute-in-BRAM
architectures, BRAMAC allows read/write access to the main BRAM array while
computing in the dummy BRAM array, enabling both persistent and tiling-based
DNN inference. We explore two BRAMAC variants: BRAMAC-2SA (with 2 synchronous
dummy arrays) and BRAMAC-1DA (with 1 double-pumped dummy array).
BRAMAC-2SA/BRAMAC-1DA can boost the peak MAC throughput of a large Arria-10
FPGA by 2.6/2.1, 2.3/2.0, and
1.9/1.7 for 2-bit, 4-bit, and 8-bit precisions, respectively at
the cost of 6.8%/3.4% increase in the FPGA core area. By adding
BRAMAC-2SA/BRAMAC-1DA to a state-of-the-art tiling-based DNN accelerator, an
average speedup of 2.05/1.7 and 1.33/1.52 can
be achieved for AlexNet and ResNet-34, respectively across different model
precisions.Comment: 11 pages, 13 figures, 3 tables, FCCM conference 202
Effect of sex on meat quality characteristics
The goal of the current work was to determine the quality of male and female cattle and buffalo meat, as well as a trial for improving meat quality by feeding the experimental groups from cattle and buffalo for four months in rations containing 16.5% protein. During the year 2021, eighty samples of cattle and buffalo were obtained from butcher shops in Luxor, Egypt. The samples were divided into four categories: male cattle, female cattle, male buffalo, and female buffalo each class was represented by 20 samples. Trails for improvement of the nutritional content of meat by feeding ration containing 16.5% protein to male cattle and buffalo, each class was represented by 10 animals. The samples were analyzed for determination of moisture, protein, fat, ash, carbohydrate, energy percentage, cooking loss, water holding capacity, tenderness as well as cholesterol determined in perinephric fat. Male beef was characterized by greater protein content (18.02% ± 0.35%). On the other hand, male buffalo was characterized by low fat content and cholesterol levels of 1.60% ± 0.85% and 294.30 ± 2.40 mg/100gm, respectively. The experimental male cattle showed the highest protein percentage (18.50% ± 0.37%) and lowest cholesterol level (267.19 ± 6.25 mg /100 gm). The use of the experimental ration could improve the quality of male beef in terms of protein value as well as cholesterol level
The power of communication: Energy-efficient NOCS for FPGAS
Integrating networks-on-chip (NoCs) on FPGAs can improve device scalability and facilitate design by abstracting com-munication and simplifying timing closure, not only between modules in the FPGA fabric but also with large “hard ” blocks such as high-speed I/O interfaces. We propose mixed and hard NoCs that add less than 1 % area to large FPGAs and run 5-6 × faster than the soft NoC equivalent. A detailed power analysis, per NoC component, shows that routers con-sume 14 × less power when implemented hard compared to soft, and whether hard or soft most of the router’s power is consumed in the input modules for buffering. For com-plete systems, hard NoCs consume less than 6 % (and as low as 3%) of the FPGA’s dynamic power budget to support 100 GB/s of communication bandwidth. We find that, de-pending on design choices, hard NoCs consume 4.5-10.4 mJ of energy per GB of data transferred. Surprisingly, this is comparable to the energy efficiency of the simplest tradi-tional interconnect on an FPGA – soft point-to-point links require 4.7 mJ/GB. In many designs, communication must include multiplexing, arbitration and/or pipelining. For all these cases, our results indicate that a hard NoC will be more energy efficient than the conventional FPGA fabric. 1
Design tradeoffs for hard and soft FPGA-based Networks-on-Chip
FPGAs has the potential not only to improve the efficiency of the interconnect, but also to increase designer productivity and reduce compile time by raising the abstraction level of communication. By comparing NoC components on FPGAs and ASICs we quantify the efficiency gap between the two platforms and use the results to understand the design tradeoffs in that space. The crossbar has the largest FPGA vs. ASIC gaps: 85× area and 4.4 × delay, while the input buffers have the smallest: 17 × area and 2.9 × delay. For a soft NoC router, these results indicate that wide datapaths, deep buffers and a small number of ports and virtual channels (VC) are favorable for FPGA implementation. If one hardens a complete state-of-the-art VC router it is on average 30 × more area efficient and can achieve 3.6 × the maximum frequency of a soft implementation. We show that this hard router can be integrated with the soft FPGA interconnect, and still achieve an area improvement of 22×. A 64-node NoC of hard routers with soft interconnect utilizes area equivalent to 1.6 % of the logic modules in the latest FPGAs, compared to 33 % for a soft NoC. I
Are We There Yet? Product Quantization and its Hardware Acceleration
Conventional multiply-accumulate (MAC) operations have long dominated
computation time for deep neural networks (DNNs). Recently, product
quantization (PQ) has been successfully applied to these workloads, replacing
MACs with memory lookups to pre-computed dot products. While this property
makes PQ an attractive solution for model acceleration, little is understood
about the associated trade-offs in terms of compute and memory footprint, and
the impact on accuracy. Our empirical study investigates the impact of
different PQ settings and training methods on layerwise reconstruction error
and end-to-end model accuracy. When studying the efficiency of deploying PQ
DNNs, we find that metrics such as FLOPs, number of parameters, and even
CPU/GPU performance, can be misleading. To address this issue, and to more
fairly assess PQ in terms of hardware efficiency, we design the first custom
hardware accelerator to evaluate the speed and efficiency of running PQ models.
We identify PQ configurations that are able to improve performance-per-area for
ResNet20 by 40%-104%, even when compared to a highly optimized conventional DNN
accelerator. Our hardware performance outperforms recent PQ solutions by 4x,
with only a 0.6% accuracy degradation. This work demonstrates the practical and
hardware-aware design of PQ models, paving the way for wider adoption of this
emerging DNN approximation methodology
α-Globin Messenger Ribonucleic Acid as a Molecular Marker for Determining the Age of Human Blood Spots in Different Temperatures
Background: Analyzing recovered evidence, such as blood which is one of the most encountered types of biological evidence, can provide information to establish the definite time when a crime was committed. This study aims to investigate the time- and temperature-related effects on human bloodstain’s α-globin messenger RNA expression and to estimate the bloodstain’s age using α-globin mRNA. Methods: A total of 22 blood samples were collected from healthy middle-aged volunteers (12 women and 10 men). After preparation, the samples were exposed to temperatures of 4°C, 24°C, and 40°C. Next, the mRNA expression of the α-globin gene was quantified by real-time RT-PCR at different time intervals of 0, 30, 90, and 150 days.Results: The α-globin gene expression showed the highest mean values by 0 day and at 4°C and the lowest mean values by 150 days and at 40°C. Samples from male participants showed higher mean values of α-globin gene expression compared to their female counterparts. A significant negative correlation was detected between α-globin gene expression and time interval. Meanwhile, a regression equation was formulated to estimate the time interval using the α-globin gene concentration.Conclusion: α-Globin mRNA could be a useful marker to estimate the age of human blood spots
Zero-Cost Proxies Meet Differentiable Architecture Search
Differentiable neural architecture search (NAS) has attracted significant
attention in recent years due to its ability to quickly discover promising
architectures of deep neural networks even in very large search spaces. Despite
its success, DARTS lacks robustness in certain cases, e.g. it may degenerate to
trivial architectures with excessive parametric-free operations such as skip
connection or random noise, leading to inferior performance. In particular,
operation selection based on the magnitude of architectural parameters was
recently proven to be fundamentally wrong showcasing the need to rethink this
aspect. On the other hand, zero-cost proxies have been recently studied in the
context of sample-based NAS showing promising results -- speeding up the search
process drastically in some cases but also failing on some of the large search
spaces typical for differentiable NAS. In this work we propose a novel
operation selection paradigm in the context of differentiable NAS which
utilises zero-cost proxies. Our perturbation-based zero-cost operation
selection (Zero-Cost-PT) improves searching time and, in many cases, accuracy
compared to the best available differentiable architecture search, regardless
of the search space size. Specifically, we are able to find comparable
architectures to DARTS-PT on the DARTS CNN search space while being over 40x
faster (total searching time 25 minutes on a single GPU)
- …